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Abstract 

Recovering three-dimensional information from two-dimensional images is the fundamental goal of 
stereo techniques. The problem of recovering depth (three-dimensional information) from a set of 
images is essentially the correspondence problem: Given a point in one image find the corresponding 
point in each of the other images. Finding potential correspondences usually involves matching some 
i mage property. If the images are from nearby positions, they will vary only slightly, simplifying the 
matching process. 

Oncea correspondence is known, solvingfor the depth is simply a matter of geometry. Real images 
are composed of noisy, discrete samples, therefore the calculated depth will contain error. Thiserror 
is a function of the baseline or distance between the images. Longer baselines result in more precise 
depths. This leads to a conflict: short baselines simplify the matching process but produce imprecise 
results; long baselines produce precise results but complicate the matching process. 

I n this paper, we present a method for generating dense depth maps from large sets (1000's) of 
images taken from arbitrary positions. Long basel i ne i mages improve the accuracy. Short baseline 
images and the large number of images greatly simplifies the correspondence problem, removing 
nearly all ambiguity. The algorithm presented is completely local and for each pixel generates an 
evidence versus depth and surface normal distribution. In many cases, the distribution contains a 
clear and distinct global maximum. The location of this peak determines the depth and its shape can 
be used to esti mate the error. The distribution can also be used to perform a maximum likelihood fit 
of models directly to the images. We anticipate that the ability to perform maximum likelihood esti- 
mation from purely local calculations will prove extremely useful in constructing three dimensional 
models from I arge sets of i mages. 
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1 Introduction 

Recovering three-dimensional information from 
two-dimensional images is the fundamental goal 
of stereo techniques. The problem of recovering 
the missing dimension, depth, from a set of im- 
ages is essentially the correspondence problem: 
Given a point in one image find the correspond- 
ing point in each of the other images. Finding po- 
tential correspondences usually involves match- 
i ng some i mage property i n two or more i mages. 
I f the i mages are from nearby positi ons, they wi 1 1 
vary only slightly, simplifying the matching pro- 
cess. 




depth 



Figure 1: Stereo calculation. 

Once a correspondence is known, solving for 
depth is simply a matter of geometry. Real im- 
ages are noisy, and measurements taken from 
them are also noisy. Figure 1 shows how the 
depth of point P can be calculated given two im- 
ages taken from known cameras Ci and C 2 and 
corresponding points pi and p 2 within those im- 
ages, which are projections of P. The location of 
Pi in the image is uncertain, as a result P can lie 
anywhere within the left cone. A similar situa- 
tion exists for p 2 . If pi and p 2 are corresponding 
points, then P could lie anywhere in the shaded 
region. Clearly, for a given depth increasing the 
baseline between Ci and C 2 will reduce the un- 
certainty in depth. This leads to a conflict: short 
basel i nes si mpl ify the matchi ng process, but pro- 
duce uncertain results; long baselines produce 
precise results, but compl i cate the matchi ng pro- 
cess. 

One popular set of approaches for dealing 



with this problem are relaxation techniques 1 
[6, 9]. These methods are generally used on 
a pair of images; start with an educated guess 
for the correspondences; then update them by 
propagating constraints. These techniques don't 
always converge and don't always recover the 
correct correspondences. Another approach is 
to use multiple images. Several researchers, 
such as Yachida [11], have proposed trinocular 
stereo algorithms. Others have also used spe- 
cial camera configurations to aid in the corre- 
spondence problem, [10, 1, 8]. Bolles, Baker and 
Marimont [1] proposed constructing an epi polar- 
plane image from a large number of images. In 
some cases, analyzing the epi polar-plane image 
is much simpler than analyzing the original set 
of images. The epi polar-plane image, however, 
is only defined for a limited set of camera po- 
sitions. Tsai [10] and Okutomi and Kanade[8] 
defined a cost function which was applied di- 
rectly to a set of images. The extremum of this 
cost function was then taken as the correct cor- 
respondence. Occlusion is assumed to be negli- 
gible. In fact, Okutomi and Kanade state that 
they "invariably obtained better results by using 
relatively short baselines." This is likely the re- 
sult of using a spatial matching metric (a corre- 
lation window) and ignoring perspective distor- 
tion. Both methods used small sets of images, 
typically about ten. They also limited camera 
positions to special configurations. Tsai used a 
localized planar configuration with parallel op- 
tic axes; and Okutomi and Kanade used short 
linear configurations. Cox et al [2] proposed a 
maxi mum-l i kel i hood framework for stereo pai rs, 
which they have extended to multiple images. 
This work attempts to explicitly model occlu- 
sions, although, in a somewhat ad hoc manner. 
1 1 uses a few gl obal constrai nts and smal I sets of 
i mages. 

The work presented here also uses multi- 
ple images and draws its major inspiration from 
Bolles, Baker and Marimont [1]. We define a 
construct called an epi polar image and use it to 
analyze evidence about depth. Like Tsai [10] 
and Okutomi and Kanade [8] we define a cost 
function that is applied across multiple images, 
and like Cox [2] we model the occlusion pro- 
cess. There are several important differences, 



1 For a more complete and detailed analysis of this and 
other techniques see [5, 7, 4]. 



however. The epi polar image we define is valid 
for arbitrary camera positions and models some 
forms of occlusion. Our method is intended to 
recover dense depth maps of built geometry (ar- 
chitectural facades) using thousands of images 
acquired from within the scene. In most cases, 
depth can be recovered using purely local in- 
formation, avoiding the computational costs of 
global constraints. Where depth cannot be recov- 
ered using purely local information, the depth 
evidence from the epi polar image provides a 
principled distribution for use in a maximum- 
likelihood approach [3]. 



2 Our Approach 

In this section, we review epi polar geometry and 
epi polar-plane images, then define a new con- 
struct called an epi polar image. We also discuss 
the construction and analysis of epi polar images. 
Stereo techniques typically assume that relative 
camera positions and internal camera calibra- 
tions are known. This is sufficient to recover 
the structure of a scene, but without additional 
information the location of the scene cannot be 
determined. We assume that camera positions 
are known in a global coordinate system such as 
might be obtained from GPS (Global Positioning 
System). Although relative positions are suffi- 
cient for the discussion in this section, global po- 
sitions allow us to perform reconstruction incre- 
mentally using disjoint scenes. We also assume 
known internal camera calibrations. The nota- 
tion we use is defined in Table 1. 
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Table 1: Notation used in this paper. 



Figure 2: Epi polar geometry. 



2.1 E pi polar Geometry 

E pi polar geometry provides a powerful stereo 
constraint. Given two cameras with known cen- 
ters Ci and C 2 and a point P in the world, the 
epi polar plane n e is defined as shown in Figure 
2. P projects to pi and p 2 on image planes II; 1 
andnf respectively. The projection of Ik. onto!! 1 
and n? produces epipolar lines l\ and l\. This 
is the essence of the epipolar constraint. Given 
anypointpon epipolar line 4 in image II; 1 , if the 
corresponding point is visible in imager^, then 
it must Neon the corresponding epipolar line^. 




F i gure 3: E pi pol ar-pl ane i mage geometry. 



2.2 E pi polar-Plane I mages 

Bolles, Baker and Marimont [1] used the epipo- 
lar constraint to construct a special image which 
they called an epi pol ar-pl ane image. As noted 
earlier, an epipolar line 4 contains all of the 
information about the epipolar plane n that 
is present in the i th image Ilf. An epi polar- 
plane image is built using all of the epipolar 
lines {4,/fc} fr° m a set °f images {n?} which 
correspond to a particular epipolar plane n* 
(Figure 3). Since all of the lines {4,*} in an 
epi pol ar-pl ane image £P k are projections of the 
same epipolar plane n*, for any given point p 
in £P k , if the corresponding point in any other 
image n? is visible, then it will also be in- 
cluded in £P k . Bolles, Baker and Marimont ex- 
ploited this property to solve the correspondence 
problem for several special cases of camera mo- 
tion. For example, with images taken at equally 
spaced points along a linear path perpendicu- 
lar to the optic axes, corresponding points form 
I i nes i n the epi pol ar-pl ane i mage; therefore find- 



ing correspondences reduces to finding lines in 
the epi pol ar-pl ane i mage. 

For a given epipolar plane nj, only those 
images whose camera centers lie on n^ 
(lCi |CjII* = o}) can be included in epipolar- 
plane image £P k . For example, using a set of 
images whose camera centers are coplanar, an 
epi pol ar-pl ane i mage can onl y be constructed for 
the epipolar plane containing the camera cen- 
ters. In other words, only a single epipolar 
line from each image can be analyzed using an 
epi pol ar-pl ane image. In order to analyze all 
of the points in a set of images using epi pol ar- 
pl ane images, all of the camera centers must be 
collinear. This can be serious limitation. 



c. 




Figure4: Epipolar image geometry. 



2.3 Epipolar Images 

For our analysis we will define an epipolar im- 
age £ which is a function of one image and a 
point in that image. An epipolar image is sim- 
ilar to an epi pol ar-pl ane image, but has one crit- 
ical difference that ensures it can be constructed 
for every pixel in an arbitrary set of images. 
Rather than use projections of a single epipolar 
plane, we construct the epipolar image from the 
pencil of epipolar planes defined by the line 4 
through one of the camera centers C* and one 
of the pixels p* in that image Uf (Figure 4). n* 



is the epi polar plane formed by 4 and the i th 
camera center Q. Epi polar line 4 contains all 
of the information about 4 present in n?. An 
epi polar-plane image is composed of projections 
of a plane; an epi polar image is composed of pro- 
jections of a line. The cost of guaranteeing an 
epi polar image can be constructed for every pixel 
is that correspondence information is accumu- 
lated for only one point p*, instead of an entire 
epi polar line. 




Figure 5: Set of points which form a possible cor- 
respondence. 

To simplify the analysis of an epi polar im- 
age we can group poi nts from the epi pol ar I i nes 
according to possible correspondences (Figure 
5). Pi projects to pj inn?; therefore {pf} has 
all of the information contained in {n?} about 
Pi. There is also a distinct set of points {pf} 
for P 2 ; therefore lp{ | for a given j\ contains all 
of the possible correspondences for P,-. If P^ 
is a point on the surface of a physical object 
and it is visible in {n?} and nf, then measure- 
ments taken at p{ should match those taken 
at p* (Figure 6a). Conversely, if Pj is not a 
point on the surface of a physical object then 
the measurements taken at pj are unlikely to 
match those taken at p* (Figures 6b and 6c). 
Epi polar images can be viewed as tools for ac- 
cumulating evidence about possible correspon- 
dences of p*. A simple function of j is used to 
build {Pj|V«< j : ||Pj-C*|| 2 < HPj-cy 2 }. In 
essence, {Pj} is a set of samples along 4 at in- 






Figure6: Occlusion effects. 



creasi ng depths from the i mage pi ane. 



2.4 Analyzi ng E pi polar I mages 

An epi polar imaged is constructed by organizing 

{f(p{) \TQ is a function of the image] 

into a two-dimensional array with % and j as the 
vertical and horizontal axes respectively. Rows 
in £ are epi polar lines from different images; 
columns form sets of possible correspondences 
ordered by depth 2 (Figure 7). The quality u(j) 
of the match between column j and p* can be 
thought of as evidence that p* is the projection 
of Pj and j is its depth. Specifically: 



i/(j) = £*(^(ri)>^(P*))> 



(1) 



where^O isafunction of theimageand X() isa 
cost function which measures the difference be- 
tween F(pi) and^(p^). A simple case is, 



T{x) = intensity values at x 



and 



X(xi : x 2 ) = —\x\ - x 2 \. 

Real cameras are finite, and p{ may not be con- 
tained in the image n? (p]0{n?}V Only terms 

for which p{ e {n?} should be included in (1). To 
correct for this, v(j) is normalized, giving: 



Hi) 



Y, x{Hvl),Hv*)) 
»|p^{ n i} 

i\pie{ni} 



(2) 



Ideally, v(j) will have a sharp, distinct peak 
at the correct depth, so that 



argmax(^(j)) 
i 



the correct depth of p* 



As the number of elements in |p| | for a given j\ 
increases, the likelihood increases that u(j) will 
be large when P^ lies on a physical surface and 
small when it does not. Occlusions do not pro- 
duce peaks at incorrect depths or false posi- 
tives 3 . They can however, cause false negatives 



2 The depth of Pj can be trivially calculated from j, 
therefore we consider j and depth to be interchangeable. 
3 Except possibly in adversarial settings. 




£ (nf , p*) 



i (image) 



Figure 7: Constructing an epi polar image. 




Figure 8: False negative caused by occlusion. 



or the absence of a peak at the correct depth (F i g- 
ure 8). A false negative is essentially a lack of 
evidence about the correct depth. Occlusions can 
reduce the height of a peak, but a dearth of con- 
curri ng i mages i s requi red to el i mi nate the peak. 
Globally this produces holes in the data. While 
less then ideal, this is not a major issue and can 
be addressed in two ways: removing the contri- 
bution of occluded views, and adding unocduded 
views by acquiring more images. 



where 




Figure 9: Exclusion region for Pj. 

A large class of occluded views can be elim- 
inated quite simply. Figure 9 shows a point Pj 
and its normal n,-. Images with camera centers 
in the hashed half space cannot possibly view 
Pj. nj is not known a priori, but the fact that 
Pj is visiblein n* limits its possible values. This 
range of val ues can then be sampl ed and used to 
eliminate occluded views from v(j). Let a be an 
estimate of n,- and QPj be the unit vector along 
therayfrom Q toPj, then Pj can only be visible 
if OP, -a<0. 



If the vicinity of {n?} is modeled (perhaps 
incompletely) by previous reconstructions, then 
this i nformati on can be used to i mprove the cur- 
rent reconstruction. Views for which the depth 4 
d(p{) at p{ is less than the distance from nf to 
Pj can also be eliminated. For example, if Mi 
and M 2 have already been reconstructed, then 
i 6 {1,2,3} can be eliminated from v(j) (Figure 
8). The updated function becomes: 
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4 Distance from nj to the closest previously recon- 
structed object or point along the ray starting at Cj in the 
direction of p^. 



S 



6 {n?} 



QP 

d(P? 






a < 
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Then, if sufficient evidence exists, 



argmax(zy(j, a)) 



j= depth of P * 
a an esti mate of n,- 



One way to eliminate occlusions such as those 
shown in Figure 8 is to process the set of epi po- 
lar images {£ k } in a best first fashion. This is 
essentially building a partial model and then 
using that model to help analyze the difficult 
spots. v{j,a) is calculated using purely local 
operations. Another approach is to incorporate 
global constraints. 

3 Results 

Syntheti c i magery was used to expl ore the char- 
acteristics of v(j) and v(j,a). A CAD model 
of Technology Square, the four-building complex 
housing our laboratory, was built by hand. The 
locations and geometries of the buildings were 
determined using traditional survey techniques. 
Photographs of the buildings were used to ex- 
tract texture maps whi ch were matched with the 
survey data. This three-dimensional model was 
then rendered from 100 positions along a "walk 
around the block" (Figure 10). From this set of 
images, a nf and p* were chosen and an epi polar 
imaged constructed. £ was then analyzed using 
two match functions: 
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and 



Y{^3-a)x(Hvl),Hv*)) 



v{j,a) 



ies 



E c * p . 

ies 



(5) 
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where 



T{x) = hsv(a;) 5 = [h(x), S(x), v(x)] J (6) 



5 hsv is the well known hue, saturation and value color 
model. 
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Figure 10: Examples of the rendered model. 



X([hi,si,v{\ ,[h 2 ,s 2 ,v 2 ] 

' Sl + S 2 



(7) 



1 — cos (hi — h 2 )) 



(2 -51 - s 2 ) \vi -v 2 \ . 

Figures 11 and 12 show a base image nf with 
p* marked by a cross. Under nf is the epi polar 
image £ generated using the remaining 99 im- 
ages. Below £ is the matching function u(j) (4) 
and p(J,a) (5). The horizontal scale, j or depth, 
is the same for £, v(j) and v(j,a). The vertical 
axis of £ is the image index, and of u(j, a) is a 
coarse estimate of the orientation a at Pj. The 
vertical axis of u(j) has no significance; it is a 
single row that has been replicated to increase 
visibility. To the right, u(j) and v{j,a) are also 
shown as two-dimensional plots 6 . 



Figure 11a shows the epi polar image that re- 
sults when the upper left-hand corner of thefore- 
ground building is chosen as p*. Near the bot- 
tom of £ , 4 is close to horizontal, and pj is the 
projection of blue sky everywhere except at the 
building corner. The corner points show up in 
£ near the right side as a vertical streak. This 
is as expected since the construction of £ places 
the projections of Pj in the same column. Near 
the middle of £, the long side to side streaks 



result because Pj is occluded, and near the top 
the large black region is produced because pj 
n?. Both u(j) and v{j,a) have a sharp peak 7 
that corresponds to the vertical stack of corner 
points. This peak occurs at a depth of 2375 
units (j = 321) for v{j) and a depth of 2385 
(j = 322) for u(j,a). The actual distance to the 
corner is 2387.4 units. The reconstructed world 
coordinates of p* are [-1441,-3084, 1830] T and 
[-1438, -3077, 1837] T respectively The actual co- 
ordi nates 8 are [-1446, -3078, 1846] T . 

Figure lib shows the epi polar image that re- 
sults when a point just on the dark side of the 
front left edge of the building is chosen as p*. 
Again both v{j) and v{j,a) have a single peak 
that agrees well with the depth obtained using 
manual correspondence. This time, however, the 
peaks are asymmetric and have much broader 
tails. This is caused by the high contrast be- 
tween the bright and dark faces of the building 
and the lack of contrast within the dark face. 
The peak in u(j, a) is slightly better than the one 
in u{j). 

Figure lie shows the epi polar image that re- 
sults when a point just on the bright side of the 
front left edge of the building is chosen as p*. 



Actually, £ Q v{h «)/ E a 1 's Plotted for v {j, « 



7 White indicates minimum error, black maximum. 

8 Some of the difference may be due to the fact that p* 
was chosen by hand and might not be the exact projection 
of the corner. 
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Figure 11: nf , p* f S, u{j) andi/(j,a) 
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Figure 12: nf , p* f S, u{j) and u(j, a) 



This time v{j) and v(j,a) are substantially dif- 
ferent. u(j) no longer has a single peak. The 
largest peak occurs at j = 370 and the next 
largest at j = 297. The manual measurement 
agrees with the peak at j = 297. The peak at 
j = 370 corresponds to the point where 4 exits 
the back side of the building. u(j,a), on the other 
hand, still has a single peak, clearly indicating 
the useful ness of esti mati ng a. 

In Figure 12a, p* is a point from the inte- 
rior of a building face. There is a clear peak in 
u(j,a) that agrees well with manual measure- 
ments and is better than that in v(j). In Figure 
12b, p* is a point on a building face that is oc- 
cluded (Figure 8) in a number of views. Both v(j) 
and u(j, a) produce fairly good peaks that agree 
with manual measurements. In Figure 12c, p* 
is a point on a building face with very low con- 
trast. In this case, neither u(j) nor u(j, a) pro- 
vide clear evidence about the correct depth. The 
actual depth occurs at j = 386. Both v{j) and 
v{j,a) lack sharp peaks in large regions with lit- 
tle or no contrast or excessive occlusion. Choos- 
ing p* as a sky or ground pixel will produce a 
nearly constant u(j) or u(j, a). 

To further test our method, we reconstructed 
the depth of a region in one of the images (Fig- 
ure 13). For each pixel inside the black rectan- 
gle the global maximum of u(j, a) was taken as 
the depth of that pixel. Figure 14a shows the 
depth for each of the 3000 pixels reconstructed 




Figure 13: Reconstructed region. 

plotted against the x image coordinate of the 
pixel. Slight thickening is caused bythefactthat 
depth changes slightly with they image coordi- 
nate. The cluster of points at the left end (near 
a depth of 7000) and at the right end correspond 
to sky points. The actual depth for each pixel 



was calculated from the CAD model. Figure 14b 
shows the actual depths (in grey) overlaid on top 
of the reconstructed val ues. F i gure 15 shows the 
same data ploted in world coordinates 9 . The ac- 
tual building faces are drawn in grey, and the 
camera position is marked by a grey line extend- 
ing from the center of projection in the direction 
of the optic axis. The reconstruction shown (Fig- 
ures 14 and 15) was performed purely locally at 
each pixel. Global constraints such as ordering 
or smoothness were not i mposed, and no attempt 
was made to remove low confidence depths or 
otherwise post-process the global maximum of 
v(j,a). 



4 Conclusions 

This paper describes a method for generating 
dense depth maps directly from large sets of im- 
ages taken from arbitrary positions. The algo- 
rithm presented uses only local calculations, is 
simple and accurate. Our method builds, then 
analyzes, an epipolar image to accumulate evi- 
dence about the depth at each image pixel . This 
analysis produces an evi dence versus depth and 
surface normal distribution that in many cases 
contains a clear and distinct global maximum. 
The location of this peak determines the depth, 
and its shape can be used to estimate the er- 
ror. The distribution can also be used to per- 
form a maximum likelihood fit of models to the 
depth map. We anticipate that the ability to 
perform maximum likelihood estimation from 
purely local calculations will prove extremely 
useful in constructing three-dimensional models 
from large sets of images. 
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9 AII coordinates have been divided by 1000 to simplify 
the plots. 
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Figure 14: Reconstructed and actual depth 
maps. 



Figure 15: Reconstructed and actual world 
poi nts. 
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